The main goal of this research is to study and analyze the data from a strategy game, called DoTA 2, where there are exactly 5 players on both teams that are thriving to destroy the enemy’s base, and usually, the one that has more resources wins. In this research, we will study the relationship between various statistical parameters that could be gathered during the game, and will analyze how they influence the amount of resources each player is able to gather in the end. Our aim is to figure out which parameters are more significant than others, and based on this knowledge, make a model that accurately predicts an amount of money, each player finishes a game with. Of course, this does not make much sense for those who do not play the game but still it can be interesting not only for players and bookmakers, but also usual people, who are curious to discover new fields.
We used dataset, that could be downloaded from https://www.kaggle.com/devinanzelmo/dota-2-matches
library(moments)
library("ggpubr")
## Loading required package: ggplot2
library("Hmisc")
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
# Reading the data
matches <- read.csv("./data/match.csv")
players <- read.csv("./data/players.csv")
# Now, merge two data frames by match index
df <- merge(players, matches,by="match_id")
df
We need to filter out matches, that last for less that 20, or more than 80 minutes, because 99% of the games last from 20 to 80 minutes. Also we discard all players whose hero_id equals 0, or leaver_status is True(as they have abandoned the game). Also, last line calculates win column (whether a player won or lost a specific game)
df_filtered <- df[(((df$duration >= 20*60) & (df$duration <= 80*60)) & (df$game_mode == 22) & ((df$hero_id == 67) & (df$leaver_status == 0))),]
# Summing up 32 columns that are responsible for how many clicks has a player done during the game and adding them to the general data frame
df_filtered <- data.frame(df_filtered, unitSum=rowSums(df_filtered[41:73], na.rm=TRUE))
print(nrow(df_filtered))
## [1] 6359
df_filtered$win <- (df_filtered$player_slot <= 4 & df_filtered$radiant_win == "True") | (df_filtered$player_slot >= 128 & df_filtered$radiant_win == "False")
df_filtered$duration <- df_filtered$duration / 60
df_filtered
In DoTA 2, net worth is calculated as the sum of all gold that a single player gained from various sources minus all gold that player has lost during the game. Finally, we remove unnecessary columns, and are ready for feature selection.
# each player gains exactly 95 gold per minute
df_filtered$gold_time = round(95 * (df_filtered$duration / 60), 0)
df_filtered <- data.frame(df_filtered, net_worth=rowSums(df_filtered[c(30:39, 87:87)], na.rm=TRUE))
# Filtering out the main parameters, and preparing data frame for feature selection
df_filtered = subset(df_filtered, select = c(match_id, net_worth,account_id,hero_id,win,xp_per_min,kills,deaths,assists,denies,last_hits,hero_damage,tower_damage,level,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,unitSum))
colnames(df_filtered)
## [1] "match_id" "net_worth"
## [3] "account_id" "hero_id"
## [5] "win" "xp_per_min"
## [7] "kills" "deaths"
## [9] "assists" "denies"
## [11] "last_hits" "hero_damage"
## [13] "tower_damage" "level"
## [15] "duration" "tower_status_radiant"
## [17] "tower_status_dire" "barracks_status_dire"
## [19] "barracks_status_radiant" "unitSum"
Now, we are left with 20 columns, and 6359 rows
hist.data.frame(df_filtered["duration"])
lapply(df_filtered["duration"], mean, na.rm = TRUE)
## $duration
## [1] 42.44989
print(paste("Duration skewness:", skewness(df_filtered["duration"])))
## [1] "Duration skewness: 0.561157313404682"
print(paste("Duration kurtosis:", kurtosis(df_filtered["duration"])))
## [1] "Duration kurtosis: 3.16686923125807"
Here we found duration mean to make prediction on, basically, duration. Skewness > 0 shows that right tail is heavier, therefore, short matches are more likely than long ones. Kurtosis > 0 shows that tails are heavier than in normal distribution, so predictions on duration are going to be less precise because of greater count of outliers.
hist.data.frame(df_filtered["net_worth"])
lapply(df_filtered["net_worth"], mean, na.rm = TRUE)
## $net_worth
## [1] 15018
print(paste("Duration skewness:", skewness(df_filtered["net_worth"])))
## [1] "Duration skewness: 0.326249798842112"
print(paste("Duration kurtosis:", kurtosis(df_filtered["net_worth"])))
## [1] "Duration kurtosis: 2.8559977248955"
Same goes for net worth mean, skewness and kurtosis, except skewness and kurtosis are bit closer to 0.
We want to test whether duration of the game and net worth of a single player are related, using \(\alpha=0.05\). \(H_0: p=0\) - the final net worth of a player does not depend on the duration of the game \(H_1: p\ne0\) - there is a non-zero relation between those two parameters
Firstly, let’s take a look at the relation on the following scatterplot:
ggscatter(df_filtered, x = "duration", y = "net_worth", color="black",
cor.coef = TRUE, add = "reg.line", conf.int = TRUE,
cor.method = "pearson", xlab = "duration(minutes)",
ylab = "net worth(gold)", size = 1, alpha=0.75, title="Relation betweet the duration of the game and player net worth")
## `geom_smooth()` using formula 'y ~ x'
From the plot above, we can see that relationship between parameters is linear.
As we have 6359 subjects, we subtract our degrees of freedom \(ds=6359-2=6357\) Also, as each observation of net_worth has corresponding pair(duration), and there are no outliers that could significantly skew our results. Finally, the shape of the scatterplot is linear(see the graph above), which means that Pearson’s correlation test could be used here. So, we will use Pearson’s correlation test to determine a linear relationship between our parameters.
Pearson correlation(r) is equal to: \(r=\frac{\sum (x-m_x)(y-m_y)}{\sqrt(\sum(x-m_x)^2\sum(y-m_y)^2)}\) and the p-value can be computed using the correlation coefficient table for 6357 degrees of freedom.
Pearson correlation test
res <- cor.test(df_filtered$net_worth, df_filtered$duration,
method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: df_filtered$net_worth and df_filtered$duration
## t = 63.698, df = 6357, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6089388 0.6389524
## sample estimates:
## cor
## 0.6241758
As we can see, the p-value=\(2.2* {10}^{-16}\) is far less that our \(\alpha=0.05\), and thus, we can conclude that net worth and duration are significantly correlated with a correlation coefficient 0.62, and p value=\(2.2 e^{-16}\), which is intuitive, because the more player develops his hero, the more resources and money he is able to gather.
source("http://www.sthda.com/upload/rquery_cormat.r")
test_correlation <- df_filtered[c(2:2, 5:20)]
col<- colorRampPalette(c("blue", "white", "red"))(20)
rquery.cormat(test_correlation, type="full")
## corrplot 0.84 loaded
## $r
## deaths tower_status_dire barracks_status_dire
## deaths 1.0000 0.0280 0.0270
## tower_status_dire 0.0280 1.0000 0.8900
## barracks_status_dire 0.0270 0.8900 1.0000
## tower_status_radiant -0.0870 -0.9500 -0.8500
## barracks_status_radiant -0.0830 -0.8800 -0.7900
## denies -0.1400 -0.0030 0.0052
## win -0.4200 -0.0360 0.0035
## tower_damage -0.3400 0.0130 0.0120
## kills -0.0580 -0.0150 0.0110
## hero_damage 0.0660 -0.0140 0.0190
## xp_per_min -0.2900 0.0300 0.0490
## level 0.0610 0.0230 0.0570
## unitSum 0.0430 -0.0059 0.0120
## assists 0.1100 -0.0150 0.0180
## duration 0.3900 0.0026 0.0360
## net_worth -0.2300 -0.0034 0.0190
## last_hits -0.0087 0.0089 0.0330
## tower_status_radiant barracks_status_radiant denies
## deaths -0.0870 -0.0830 -0.1400
## tower_status_dire -0.9500 -0.8800 -0.0030
## barracks_status_dire -0.8500 -0.7900 0.0052
## tower_status_radiant 1.0000 0.9300 -0.0076
## barracks_status_radiant 0.9300 1.0000 -0.0019
## denies -0.0076 -0.0019 1.0000
## win 0.0140 0.0280 0.0800
## tower_damage -0.0680 -0.0730 0.1700
## kills -0.0620 -0.0470 0.1200
## hero_damage -0.0950 -0.0800 0.1600
## xp_per_min -0.0500 -0.0350 0.1700
## level -0.1300 -0.1100 0.1300
## unitSum -0.0750 -0.0630 0.1700
## assists -0.0780 -0.0570 0.0300
## duration -0.1500 -0.1400 0.0360
## net_worth -0.0960 -0.0810 0.2100
## last_hits -0.1300 -0.1100 0.2800
## win tower_damage kills hero_damage xp_per_min
## deaths -0.4200 -0.340 -0.058 0.066 -0.290
## tower_status_dire -0.0360 0.013 -0.015 -0.014 0.030
## barracks_status_dire 0.0035 0.012 0.011 0.019 0.049
## tower_status_radiant 0.0140 -0.068 -0.062 -0.095 -0.050
## barracks_status_radiant 0.0280 -0.073 -0.047 -0.080 -0.035
## denies 0.0800 0.170 0.120 0.160 0.170
## win 1.0000 0.670 0.450 0.370 0.540
## tower_damage 0.6700 1.000 0.600 0.560 0.550
## kills 0.4500 0.600 1.000 0.850 0.660
## hero_damage 0.3700 0.560 0.850 1.000 0.590
## xp_per_min 0.5400 0.550 0.660 0.590 1.000
## level 0.4200 0.530 0.720 0.810 0.780
## unitSum 0.1200 0.240 0.320 0.420 0.210
## assists 0.4700 0.420 0.440 0.580 0.440
## duration 0.1200 0.270 0.470 0.680 0.230
## net_worth 0.6000 0.730 0.780 0.810 0.680
## last_hits 0.2600 0.480 0.540 0.710 0.440
## level unitSum assists duration net_worth last_hits
## deaths 0.061 0.0430 0.110 0.3900 -0.2300 -0.0087
## tower_status_dire 0.023 -0.0059 -0.015 0.0026 -0.0034 0.0089
## barracks_status_dire 0.057 0.0120 0.018 0.0360 0.0190 0.0330
## tower_status_radiant -0.130 -0.0750 -0.078 -0.1500 -0.0960 -0.1300
## barracks_status_radiant -0.110 -0.0630 -0.057 -0.1400 -0.0810 -0.1100
## denies 0.130 0.1700 0.030 0.0360 0.2100 0.2800
## win 0.420 0.1200 0.470 0.1200 0.6000 0.2600
## tower_damage 0.530 0.2400 0.420 0.2700 0.7300 0.4800
## kills 0.720 0.3200 0.440 0.4700 0.7800 0.5400
## hero_damage 0.810 0.4200 0.580 0.6800 0.8100 0.7100
## xp_per_min 0.780 0.2100 0.440 0.2300 0.6800 0.4400
## level 1.000 0.4600 0.650 0.7800 0.8300 0.7600
## unitSum 0.460 1.0000 0.330 0.5100 0.4400 0.5100
## assists 0.650 0.3300 1.000 0.5800 0.5600 0.4400
## duration 0.780 0.5100 0.580 1.0000 0.6200 0.7600
## net_worth 0.830 0.4400 0.560 0.6200 1.0000 0.8300
## last_hits 0.760 0.5100 0.440 0.7600 0.8300 1.0000
##
## $p
## deaths tower_status_dire barracks_status_dire
## deaths 0.0e+00 0.0240 3.2e-02
## tower_status_dire 2.4e-02 0.0000 0.0e+00
## barracks_status_dire 3.2e-02 0.0000 0.0e+00
## tower_status_radiant 4.8e-12 0.0000 0.0e+00
## barracks_status_radiant 2.9e-11 0.0000 0.0e+00
## denies 3.0e-30 0.8100 6.8e-01
## win 3.2e-266 0.0039 7.8e-01
## tower_damage 1.1e-168 0.3000 3.5e-01
## kills 3.4e-06 0.2300 3.7e-01
## hero_damage 1.6e-07 0.2700 1.3e-01
## xp_per_min 2.9e-122 0.0160 7.9e-05
## level 1.0e-06 0.0690 6.1e-06
## unitSum 5.9e-04 0.6400 3.3e-01
## assists 1.8e-17 0.2200 1.6e-01
## duration 2.1e-227 0.8400 3.9e-03
## net_worth 8.0e-78 0.7800 1.3e-01
## last_hits 4.9e-01 0.4800 9.3e-03
## tower_status_radiant barracks_status_radiant denies
## deaths 4.8e-12 2.9e-11 3.0e-30
## tower_status_dire 0.0e+00 0.0e+00 8.1e-01
## barracks_status_dire 0.0e+00 0.0e+00 6.8e-01
## tower_status_radiant 0.0e+00 0.0e+00 5.4e-01
## barracks_status_radiant 0.0e+00 0.0e+00 8.8e-01
## denies 5.4e-01 8.8e-01 0.0e+00
## win 2.5e-01 2.6e-02 1.4e-10
## tower_damage 6.4e-08 5.6e-09 2.1e-40
## kills 7.6e-07 1.7e-04 1.4e-21
## hero_damage 2.6e-14 1.7e-10 6.7e-37
## xp_per_min 7.0e-05 5.0e-03 4.6e-42
## level 1.7e-24 6.6e-18 2.5e-25
## unitSum 2.2e-09 4.0e-07 2.5e-44
## assists 5.5e-10 4.7e-06 1.7e-02
## duration 1.4e-33 2.2e-27 3.8e-03
## net_worth 1.9e-14 9.9e-11 2.1e-63
## last_hits 1.2e-23 1.1e-18 1.2e-112
## win tower_damage kills hero_damage xp_per_min
## deaths 3.2e-266 1.1e-168 3.4e-06 1.6e-07 2.9e-122
## tower_status_dire 3.9e-03 3.0e-01 2.3e-01 2.7e-01 1.6e-02
## barracks_status_dire 7.8e-01 3.5e-01 3.7e-01 1.3e-01 7.9e-05
## tower_status_radiant 2.5e-01 6.4e-08 7.6e-07 2.6e-14 7.0e-05
## barracks_status_radiant 2.6e-02 5.6e-09 1.7e-04 1.7e-10 5.0e-03
## denies 1.4e-10 2.1e-40 1.4e-21 6.7e-37 4.6e-42
## win 0.0e+00 0.0e+00 1.7e-309 1.8e-204 0.0e+00
## tower_damage 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## kills 1.7e-309 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## hero_damage 1.8e-204 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## xp_per_min 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## level 3.1e-274 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## unitSum 2.4e-21 4.3e-81 2.5e-147 9.5e-267 2.2e-64
## assists 0.0e+00 1.7e-266 3.0e-293 0.0e+00 1.8e-301
## duration 1.7e-22 8.2e-110 0.0e+00 0.0e+00 1.7e-75
## net_worth 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## last_hits 2.7e-98 0.0e+00 0.0e+00 0.0e+00 5.7e-307
## level unitSum assists duration net_worth
## deaths 1.000000e-06 5.900000e-04 1.8e-17 2.1e-227 8.0e-78
## tower_status_dire 6.900000e-02 6.400000e-01 2.2e-01 8.4e-01 7.8e-01
## barracks_status_dire 6.100000e-06 3.300000e-01 1.6e-01 3.9e-03 1.3e-01
## tower_status_radiant 1.700000e-24 2.200000e-09 5.5e-10 1.4e-33 1.9e-14
## barracks_status_radiant 6.600000e-18 4.000000e-07 4.7e-06 2.2e-27 9.9e-11
## denies 2.500000e-25 2.500000e-44 1.7e-02 3.8e-03 2.1e-63
## win 3.100000e-274 2.400000e-21 0.0e+00 1.7e-22 0.0e+00
## tower_damage 0.000000e+00 4.300000e-81 1.7e-266 8.2e-110 0.0e+00
## kills 0.000000e+00 2.500000e-147 3.0e-293 0.0e+00 0.0e+00
## hero_damage 0.000000e+00 9.500000e-267 0.0e+00 0.0e+00 0.0e+00
## xp_per_min 0.000000e+00 2.200000e-64 1.8e-301 1.7e-75 0.0e+00
## level 0.000000e+00 9.881313e-324 0.0e+00 0.0e+00 0.0e+00
## unitSum 9.881313e-324 0.000000e+00 6.2e-162 0.0e+00 3.7e-302
## assists 0.000000e+00 6.200000e-162 0.0e+00 0.0e+00 0.0e+00
## duration 0.000000e+00 0.000000e+00 0.0e+00 0.0e+00 0.0e+00
## net_worth 0.000000e+00 3.700000e-302 0.0e+00 0.0e+00 0.0e+00
## last_hits 0.000000e+00 0.000000e+00 7.6e-306 0.0e+00 0.0e+00
## last_hits
## deaths 4.9e-01
## tower_status_dire 4.8e-01
## barracks_status_dire 9.3e-03
## tower_status_radiant 1.2e-23
## barracks_status_radiant 1.1e-18
## denies 1.2e-112
## win 2.7e-98
## tower_damage 0.0e+00
## kills 0.0e+00
## hero_damage 0.0e+00
## xp_per_min 5.7e-307
## level 0.0e+00
## unitSum 0.0e+00
## assists 7.6e-306
## duration 0.0e+00
## net_worth 0.0e+00
## last_hits 0.0e+00
##
## $sym
## deaths tower_status_dire barracks_status_dire
## deaths 1
## tower_status_dire 1
## barracks_status_dire + 1
## tower_status_radiant * +
## barracks_status_radiant + ,
## denies
## win .
## tower_damage .
## kills
## hero_damage
## xp_per_min
## level
## unitSum
## assists
## duration .
## net_worth
## last_hits
## tower_status_radiant barracks_status_radiant denies win
## deaths
## tower_status_dire
## barracks_status_dire
## tower_status_radiant 1
## barracks_status_radiant * 1
## denies 1
## win 1
## tower_damage ,
## kills .
## hero_damage .
## xp_per_min .
## level .
## unitSum
## assists .
## duration
## net_worth .
## last_hits
## tower_damage kills hero_damage xp_per_min level unitSum
## deaths
## tower_status_dire
## barracks_status_dire
## tower_status_radiant
## barracks_status_radiant
## denies
## win
## tower_damage 1
## kills . 1
## hero_damage . + 1
## xp_per_min . , . 1
## level . , + , 1
## unitSum . . . 1
## assists . . . . , .
## duration . , , .
## net_worth , , + , + .
## last_hits . . , . , .
## assists duration net_worth last_hits
## deaths
## tower_status_dire
## barracks_status_dire
## tower_status_radiant
## barracks_status_radiant
## denies
## win
## tower_damage
## kills
## hero_damage
## xp_per_min
## level
## unitSum
## assists 1
## duration . 1
## net_worth . , 1
## last_hits . , + 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
As we can see from correlation matrix and graph above, the following parameters have the most significant impact on the net worth: deaths, win, tower_damage, kills, tower_damage, xp_per_min level, unitSum, assists, duration, last_hits
model.lm = lm(net_worth~ deaths+win+tower_damage+kills+xp_per_min+level+unitSum+assists+duration+last_hits, data=df_filtered)
summary(model.lm)
##
## Call:
## lm(formula = net_worth ~ deaths + win + tower_damage + kills +
## xp_per_min + level + unitSum + assists + duration + last_hits,
## data = df_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9243.7 -1025.1 -95.0 914.6 12228.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.578e+03 1.783e+02 -8.849 < 2e-16 ***
## deaths -3.075e+02 9.802e+00 -31.368 < 2e-16 ***
## winTRUE 2.539e+03 7.223e+01 35.146 < 2e-16 ***
## tower_damage 4.940e-01 2.149e-02 22.992 < 2e-16 ***
## kills 3.227e+02 5.912e+00 54.585 < 2e-16 ***
## xp_per_min -2.867e+00 1.064e+00 -2.694 0.00708 **
## level 3.646e+02 5.189e+01 7.026 2.34e-12 ***
## unitSum 5.660e-03 1.187e-02 0.477 0.63338
## assists 6.342e+00 4.476e+00 1.417 0.15654
## duration 2.055e+01 1.414e+01 1.453 0.14618
## last_hits 2.889e+01 4.303e-01 67.143 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1794 on 6348 degrees of freedom
## Multiple R-squared: 0.9378, Adjusted R-squared: 0.9377
## F-statistic: 9577 on 10 and 6348 DF, p-value: < 2.2e-16
We can see that our standard error is equal 1794, and our determination coefficient, \(r^2\) is 0.9378, which means that our model is well-fit, and that we achieved a decent result overall. Also, small p-value< 2.2e-16 indicates that there is a significant relationship between our parameters, and net worth of a player.
Let’s test our model by applying it to the average game of our friend:
friends_data <-data.frame(deaths=7,win=FALSE,kills=12,tower_damage=2900,xp_per_min=721,level=28,unitSum=2000,assists=14,duration=58,last_hits=400)
predict(model.lm, newdata=friends_data, interval="prediction")
## fit lwr upr
## 1 22564.33 19040.99 26087.67
In reality, his net worth at the end of the game was exactly 24000, so our model did a really good job.
As a result we figured out that there is a linear relationship between the majority of parameters, and in hypothesis testing using Pearson correlation test, we have found parameters that proved to be most significant in our future prediction. Finally, after making sure that linear regression model is applicable to our task, we have trained a multiple linear regression model. It gave us a successful prediction on real-world data, that it has not seen before.